feat: add clickhouse-bench with auto-downloaded ClickHouse binary by fastio · Pull Request #6736 · vortex-data/vortex

fastio · 2026-03-02T10:14:09Z

Introduce a new clickhouse-bench benchmark crate that runs ClickBench queries against Parquet data via clickhouse-local, providing a baseline for comparing Vortex performance against ClickHouse.

Key design decisions:

build.rs auto-downloads the full ClickHouse binary (with Parquet support) into target/clickhouse-local/, similar to how vortex-duckdb downloads the DuckDB library. This eliminates manual install steps and avoids issues with slim/homebrew builds lacking Parquet support.
The binary path is baked in via CLICKHOUSE_BINARY env at compile time; CLICKHOUSE_LOCAL env var allows runtime override.
ClickHouse-dialect SQL queries are maintained in a separate clickbench_clickhouse_queries.sql file (43 queries).
CI workflows updated to include clickhouse:parquet target in ClickBench benchmarks and conditionally build clickhouse-bench.

#6425

joseph-isaacs · 2026-03-02T10:34:55Z

vortex-bench/clickbench_clickhouse_queries.sql

why do we need this file is it difference to the already included one?

Good catch! I have removed the duplicate clickbench_clickhouse_queries.sql and validated with cargo check -p vortex-bench.

myrrc

I don't think downloading untrusted binaries from internet via a build script is a good idea. We want first-class integration with duckdb thus we need to download its sources (although I'd not do it in build script as well), but we don't need such integration with Clickhouse yet.

My idea is to use clickhouse binary in CI (as it runs on Linux only) and require users to download it by hand if they want a local run. Benchmarking on MacOS doesn't make much sense anyway as vectorized instrustion set is different.

fastio · 2026-03-03T13:05:16Z

I don't think downloading untrusted binaries from internet via a build script is a good idea. We want first-class integration with duckdb thus we need to download its sources (although I'd not do it in build script as well), but we don't need such integration with Clickhouse yet.

My idea is to use clickhouse binary in CI (as it runs on Linux only) and require users to download it by hand if they want a local run. Benchmarking on MacOS doesn't make much sense anyway as vectorized instrustion set is different.

Agreed — removed the binary download from build.rs entirely. The clickhouse binary is now resolved at runtime: via CLICKHOUSE_BINARY env var or from $PATH. CI installs it via the official installer before building. Local users need to install it manually. No more untrusted binary downloads in the build script.

myrrc · 2026-03-03T17:27:07Z

.github/workflows/sql-benchmarks.yml

+      - name: Install ClickHouse
+        if: contains(matrix.targets, 'clickhouse:')
+        run: |
+          curl https://clickhouse.com/ | sh


Why not download the latest release file for our architecture from Github releases? We then don't need any installation and curl in general.

Good call — updated CI to download the static binary directly from GitHub Releases (pined ClickHouse to LTS release v25.8.18.1 from GitHub Releases), no curl | sh or installation needed.

myrrc · 2026-03-09T14:31:48Z

vortex-bench/src/clickbench/benchmark.rs

+            return query.to_string();
+        }
+
+        strip_simple_identifier_quotes(query)


Clickhouse does handle quoted identifiers correctly so I think we can pass them through to reduce this PR's diff.

myrrc · 2026-03-09T14:32:42Z

The changes look good to me conceptually, let's see what the CI run says.

Introduce a new clickhouse-bench benchmark crate that runs ClickBench queries against Parquet data via clickhouse-local, providing a baseline for comparing Vortex performance against ClickHouse. Key design decisions: - build.rs auto-downloads the full ClickHouse binary (with Parquet support) into target/clickhouse-local/, similar to how vortex-duckdb downloads the DuckDB library. This eliminates manual install steps and avoids issues with slim/homebrew builds lacking Parquet support. - The binary path is baked in via CLICKHOUSE_BINARY env at compile time; CLICKHOUSE_LOCAL env var allows runtime override. - ClickHouse-dialect SQL queries are maintained in a separate clickbench_clickhouse_queries.sql file (43 queries). - CI workflows updated to include clickhouse:parquet target in ClickBench benchmarks and conditionally build clickhouse-bench. Signed-off-by: fastio <pengjian.uestc@gmail.com>

…dling Signed-off-by: fastio <pengjian.uestc@gmail.com>

…use from PATH - Remove reqwest-based binary download from build.rs - Resolve clickhouse binary via CLICKHOUSE_BINARY env var or $PATH at runtime - Add CI step to install clickhouse before building when needed - Fail with clear error message if binary is not found locally Signed-off-by: fastio <pengjian.uestc@gmail.com>

- Pass subcommand arg to clickhouse-bench in run-sql-bench.sh for consistency - Use BenchmarkArg + create_benchmark() in main.rs like other engines - Replace `which` with `clickhouse local --version` for binary verification - Pin ClickHouse to LTS release v25.8.18.1 from GitHub Releases Signed-off-by: fastio <pengjian.uestc@gmail.com>

…identifier handling. Queries are now returned as-is without dialect-specific transformation. Signed-off-by: fastio <pengjian.uestc@gmail.com>

Signed-off-by: fastio <pengjian.uestc@gmail.com>

Signed-off-by: Peng Jian <pengjian.uestc@gmail.com>

codspeed-hq · 2026-03-10T14:19:37Z

Merging this PR will degrade performance by 20.9%

❌ 3 regressed benchmarks
✅ 1026 untouched benchmarks
⏩ 1466 skipped benchmarks¹

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`take_200k_dispersed`	3.6 ms	4.5 ms	-19.62%
❌	Simulation	`patched_take_200k_first_chunk_only`	4.8 ms	5.4 ms	-10.69%
❌	Simulation	`take_200k_first_chunk_only`	3.3 ms	4.2 ms	-20.9%

_{Comparing fastio:integration-clickhouse-benchmark-baseline (14df836) with develop (ba77972)}

1466 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

connortsui20 · 2026-03-10T14:23:39Z

@fastio Feel free to ping us in the public slack channel if you want us to run CI for you! (Feel free to ping me here as well)

Edit: To fix the CI issues right now, could you update the lockfile with cargo check?

Signed-off-by: fastio <pengjian.uestc@gmail.com>

fastio · 2026-03-11T12:51:30Z

@fastio Feel free to ping us in the public slack channel if you want us to run CI for you! (Feel free to ping me here as well)

Edit: To fix the CI issues right now, could you update the lockfile with cargo check?

@connortsui20 Thanks! I've updated the lockfile by running cargo check. The Cargo.lock should now be in sync. Let me know if CI still has issues after this.

connortsui20 · 2026-03-11T13:22:01Z

@fastio you can click on the github action that is failing (in this case it is the "CI / Rust (list) (pull_request)" action) and see the problem. Let me know if you need help with this!

0ax1

It's unclear yet, whether we actually want to maintain this integration. Will loop back with the team.

connortsui20

Before addressing some of these comments, can you fix the clippy lint? We should actually run these benchmarks to see if it even works.

connortsui20 · 2026-03-11T13:48:56Z

benchmarks/clickhouse-bench/src/lib.rs

+        let time_instant = Instant::now();
+
+        // The `clickhouse` binary is a multi-tool; invoke it as `clickhouse local`.
+        let mut child = Command::new(&self.binary)
+            .args(["local", "--format", "TabSeparated"])
+            .stdin(Stdio::piped())
+            .stdout(Stdio::piped())
+            .stderr(Stdio::piped())
+            .spawn()
+            .context("Failed to spawn clickhouse-local")?;
+
+        // Write SQL to stdin
+        {
+            let stdin = child
+                .stdin
+                .as_mut()
+                .context("Failed to open clickhouse-local stdin")?;
+            stdin
+                .write_all(full_sql.as_bytes())
+                .context("Failed to write SQL to clickhouse-local stdin")?;
+        }


I don't think this is right. If you look at the other benchmark engine setup, we do not open an entire new process and write to stdin, which means this engine is going to have wildly different characteristics to the rest of the engines.

I also don't know of a good way to solve this since I have not worked much with Clickhouse before. Do you have any ideas how to make this more consistent with the other engines?

No but we essentially do this. DuckDB we close and re-open the database. DataFusion we start a new session. etc.

I don't believe this is true (or not true anymore?). In our benchmarks/datafusion-bench/src/main.rs and benchmarks/duckdb-bench/src/lib.rs we create the connections and only time the query

That's a fair point — the current timing does include process startup overhead. This is a constraint of clickhouse-local's non-interactive mode, which reads all of stdin before executing any queries, so I can't keep a persistent connection the way DuckDB and DataFusion do.

That said, the process startup cost is relatively small compared to query execution time (especially for ClickBench queries). If we want to isolate it, I could run a no-op query first and subtract the baseline, or we could accept the small overhead as part of the "no caching" trade-off. Let me know what you'd prefer.

You're right that DuckDB and DataFusion create connections once and time only the query execution. The per-query process spawning is an architectural constraint of clickhouse-local — in piped stdin mode, it reads all of stdin before executing anything, which makes a persistent-process model impossible (I've documented this in the module-level docs).

That said, the overhead is minimal: process spawn is ~1ms, and the CREATE VIEW statements are lightweight references to Parquet files via file(). For ClickBench queries running 100ms+, this adds <1% overhead. Happy to explore alternatives like --queries-file if you think it's worth it, but I believe the current approach is pragmatically fine.

connortsui20 · 2026-03-11T13:50:30Z

benchmarks/clickhouse-bench/src/lib.rs

+        // Build the full SQL: setup views + the actual query
+        let mut full_sql = String::new();
+        for stmt in &self.setup_sql {
+            full_sql.push_str(stmt);
+            full_sql.push('\n');
+        }


This also seems strange to me, if this is setup work then why are we timing it in our measurement?

The setup SQL (CREATE VIEW statements) is indeed being timed along with the actual query. I'll separate the setup phase from the measurement phase to be consistent with how DuckDB and DataFusion
handle this.

connortsui20 · 2026-03-11T13:52:11Z

.github/scripts/run-sql-bench.sh

+# ClickHouse-bench only runs for local benchmarks (clickhouse-local reads local files).
+if ! $is_remote && [[ "$has_clickhouse" == "true" ]] && [[ -f "target/release_debug/clickhouse-bench" ]]; then


If you look above at the lance setup code, we have a better guard against benching something on remote that is not supposed to be, can you mimic that?

Nice catch — the ^clickhouse: anchor only matched when clickhouse was the first target in the comma-separated string. Dropped the ^ from both the remote guard and the has_clickhouse detection to match the lance pattern.

connortsui20 · 2026-03-11T13:52:39Z

benchmarks/clickhouse-bench/Cargo.toml

+[dependencies]
+anyhow = { workspace = true }
+clap = { workspace = true, features = ["derive"] }
+tokio = { workspace = true, features = ["full"] }


I doubt you need "full" features here

Good catch — removed features = ["full"] from tokio. The benchmark only uses tokio::runtime::Runtime for one async call (generate_base_data), so workspace-level features are sufficient.

connortsui20 · 2026-03-11T13:53:57Z

vortex-bench/src/clickbench/benchmark.rs

+    fn queries_file_path(&self) -> PathBuf {
+        if let Some(file) = &self.queries_file {
+            return file.into();
+        }
+        let manifest_dir = PathBuf::from(env!("CARGO_MANIFEST_DIR"));
+        manifest_dir.join("clickbench_queries.sql")
+    }


This refactoring seems somewhat unnecessary

Could you clarify what you're referring to here? This file (vortex-bench/src/clickbench/benchmark.rs) has no changes in the current PR — the queries_file_path() method and queries_file field already exist on develop. This comment may have been targeting an earlier revision that was rebased out. Let me know if there's something specific you'd like addressed!

fastio · 2026-03-12T07:13:19Z

It's unclear yet, whether we actually want to maintain this integration. Will loop back with the team.

Hi @0ax1, this PR is Phase 1 of the plan discussed in #6425 with @gatesn. The idea was to first integrate ClickHouse into the ClickBench benchmarking framework
(Parquet baseline only) before submitting the full vortex-clickhouse crate.

As @gatesn put it: "I think we should also look to at Clickbench to the benchmarking framework first, even without support for Vortex. That will give us some sanity checks on the PR that we are doing things
sensibly regarding both performance and correctness."

This PR intentionally has a narrow scope — it only adds ClickHouse as a benchmark engine running against Parquet via clickhouse-local. The maintenance burden is minimal: one benchmark crate with no custom
library bindings. Happy to discuss further if the team has concerns!

gatesn · 2026-03-13T18:15:44Z

Just to confirm, I'm happy to go ahead and merge this. Sounds like there's some open questions about where to put the benchmark timings.

Also it's worth making sure the caching behavior is consistent with our other benchmarks. We haven't actually documented precisely what this is yet, but typically means no caching at all, even things like footers.

Signed-off-by: Peng Jian <pengjian.uestc@gmail.com>

fastio · 2026-03-16T13:20:39Z

Thanks for confirming the merge! On the two points:

Benchmark timings: Could you clarify what you have in mind? Are you referring to where the timing results should be reported (e.g., stdout, a results file, CI artifacts), or how the timing logic is organized in code? Happy to follow whatever convention you prefer.

Caching behavior: Since clickhouse-local spawns a fresh process per query, there's no cross-query caching at all — no metadata cache, no footer cache, no warm-up effects. This is actually stricter than DuckDB's current setup (which has parquet_metadata_cache = true). If we want to formalize the "no caching" policy across all engines, I can open a follow-up issue to track that.

fastio · 2026-03-16T13:51:02Z

@gatesn Thanks for confirming! On caching: the per-process model is actually the most cache-free approach of all our engines. Each query spawns a fresh clickhouse-local process with zero state — no metadata cache, no footer cache, no in-memory buffers. The CREATE VIEW statements just create references via file() without pre-reading.

The only shared layer is the OS page cache, which affects all engines equally. So the caching behavior is at least as strict as our other benchmarks (arguably stricter than DuckDB, which enables parquet_metadata_cache).

On timing placement — currently results go to ch-results.json and get appended to results.json alongside other engines. Let me know if you'd prefer a different approach.

Remove ^ anchor from clickhouse grep patterns to match targets regardless of position in the comma-separated string, consistent with the lance guard pattern. Signed-off-by: fastio <pengjian.uestc@gmail.com>

myrrc self-requested a review March 2, 2026 10:32

joseph-isaacs reviewed Mar 2, 2026

View reviewed changes

myrrc reviewed Mar 3, 2026

View reviewed changes

fastio requested review from joseph-isaacs and myrrc March 9, 2026 02:22

myrrc reviewed Mar 9, 2026

View reviewed changes

myrrc added changelog/feature A new feature changelog/ci labels Mar 9, 2026

myrrc self-assigned this Mar 9, 2026

fastio force-pushed the integration-clickhouse-benchmark-baseline branch from 82c11d1 to 60aa2d2 Compare March 10, 2026 13:13

fastio added 6 commits March 10, 2026 21:27

bench(clickbench): tighten ClickHouse query normalization and URL han…

38d2fe7

…dling Signed-off-by: fastio <pengjian.uestc@gmail.com>

Remove the and helpers that were only needed for ClickHouse unquoted …

7fffcf5

…identifier handling. Queries are now returned as-is without dialect-specific transformation. Signed-off-by: fastio <pengjian.uestc@gmail.com>

fix build errors

5aa201a

Signed-off-by: fastio <pengjian.uestc@gmail.com>

fastio force-pushed the integration-clickhouse-benchmark-baseline branch from 60aa2d2 to 5aa201a Compare March 10, 2026 13:28

Merge branch 'develop' into integration-clickhouse-benchmark-baseline

fb28623

Signed-off-by: Peng Jian <pengjian.uestc@gmail.com>

update Cargo.lock

14df836

Signed-off-by: fastio <pengjian.uestc@gmail.com>

myrrc self-requested a review March 11, 2026 13:17

myrrc approved these changes Mar 11, 2026

View reviewed changes

0ax1 requested changes Mar 11, 2026

View reviewed changes

connortsui20 requested a review from 0ax1 March 11, 2026 13:29

connortsui20 requested changes Mar 11, 2026

View reviewed changes

fastio added 3 commits March 16, 2026 19:16

fix bugs

9a1eac8

Merge branch 'develop' into integration-clickhouse-benchmark-baseline

178e43f

Signed-off-by: Peng Jian <pengjian.uestc@gmail.com>

Merge branch 'develop' into integration-clickhouse-benchmark-baseline

1770a43

Fix clickhouse grep anchor in run-sql-bench.sh

42a591c

Remove ^ anchor from clickhouse grep patterns to match targets regardless of position in the comma-separated string, consistent with the lance guard pattern. Signed-off-by: fastio <pengjian.uestc@gmail.com>

fastio force-pushed the integration-clickhouse-benchmark-baseline branch from 525d1d0 to 42a591c Compare March 16, 2026 13:57

Merge branch 'develop' into integration-clickhouse-benchmark-baseline

1f8e76c

		# ClickHouse-bench only runs for local benchmarks (clickhouse-local reads local files).
		if ! $is_remote && [[ "$has_clickhouse" == "true" ]] && [[ -f "target/release_debug/clickhouse-bench" ]]; then

Conversation

fastio commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fastio Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

myrrc left a comment

Choose a reason for hiding this comment

Uh oh!

fastio commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

myrrc commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codspeed-hq bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will degrade performance by 20.9%

Performance Changes

Footnotes

Uh oh!

connortsui20 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fastio commented Mar 11, 2026

Uh oh!

connortsui20 commented Mar 11, 2026

Uh oh!

0ax1 left a comment

Choose a reason for hiding this comment

Uh oh!

connortsui20 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fastio commented Mar 12, 2026

Uh oh!

gatesn commented Mar 13, 2026

Uh oh!

fastio commented Mar 16, 2026

Uh oh!

fastio commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

fastio commented Mar 2, 2026 •

edited

Loading

fastio Mar 3, 2026 •

edited

Loading

fastio commented Mar 3, 2026 •

edited

Loading

myrrc commented Mar 9, 2026 •

edited

Loading

codspeed-hq bot commented Mar 10, 2026 •

edited

Loading

connortsui20 commented Mar 10, 2026 •

edited

Loading